This is an outline for your final project. You can follow this outline, or you can modify it as you see fit. Do what works best for your project.
Guiding / Research Question: What injuries are the most common within the National Basketball Association (NBA), and what injuries on average cause players to miss the most amount of time recovering. Also, what role does rest play in preventing and sometimes leading towards an injury?
Include R code and written explanation to import your TWO data sets.
# Load Libraries
install.packages("ggplot2", repos = "http://cran.us.r-project.org")
## Warning in readRDS(dest): lzma decoder corrupt data
##
## The downloaded binary packages are in
## /var/folders/40/w1550m8d72d2f7hxx4v2nn2r0000gn/T//RtmpBs64NX/downloaded_packages
install.packages("rvest", repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/40/w1550m8d72d2f7hxx4v2nn2r0000gn/T//RtmpBs64NX/downloaded_packages
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/40/w1550m8d72d2f7hxx4v2nn2r0000gn/T//RtmpBs64NX/downloaded_packages
install.packages("ggraph", repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/40/w1550m8d72d2f7hxx4v2nn2r0000gn/T//RtmpBs64NX/downloaded_packages
library(stringr)
## Warning: package 'stringr' was built under R version 4.1.2
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.1.2
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggraph)
## Warning: package 'ggraph' was built under R version 4.1.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.1.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.2
## Warning: package 'tibble' was built under R version 4.1.2
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.2
## Warning: package 'purrr' was built under R version 4.1.2
## Warning: package 'dplyr' was built under R version 4.1.2
## Warning: package 'forcats' was built under R version 4.1.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rvest)
## Warning: package 'rvest' was built under R version 4.1.2
##
## Attaching package: 'rvest'
##
## The following object is masked from 'package:readr':
##
## guess_encoding
library(dplyr)
This dataset consists of All NBA injuries from the year 2010 to 2020 tracked by players and teams. This data is collected by the teams of the individual players to create injury reports for the official NBA report, they were originally collected to keep track of how many players were getting injured and how they were getting injured, ie. what injuries were the most prevalent among players.
injuries <- read_csv(file="injuries_2010-2020.csv")
## Rows: 27105 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Team, Acquired, Relinquished, Notes
## date (1): Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(injuries)
## spc_tbl_ [27,105 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Date : Date[1:27105], format: "2010-10-03" "2010-10-06" ...
## $ Team : chr [1:27105] "Bulls" "Pistons" "Pistons" "Blazers" ...
## $ Acquired : chr [1:27105] NA NA NA NA ...
## $ Relinquished: chr [1:27105] "Carlos Boozer" "Jonas Jerebko" "Terrico White" "Jeff Ayres" ...
## $ Notes : chr [1:27105] "fractured bone in right pinky finger (out indefinitely)" "torn right Achilles tendon (out indefinitely)" "broken fifth metatarsal in right foot (out indefinitely)" "torn ACL in right knee (out indefinitely)" ...
## - attr(*, "spec")=
## .. cols(
## .. Date = col_date(format = ""),
## .. Team = col_character(),
## .. Acquired = col_character(),
## .. Relinquished = col_character(),
## .. Notes = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
head(injuries)
sample_n(injuries, 10)
nrow(injuries)
## [1] 27105
This dataset consists of documented NBA injuries from the 1951 season to the 2023 season tracked by the teams. This data was collected by the teams to create injury reports, however it needs to be taken into account how modern medicine and science was diffrent in the 50’s, 60’s and 70’s, and could contribute to the recovery time for similar injured to be prolonged.
oldInjuries <- read_csv(file="NBA_Player_Injury_Stats1951-2023.csv")
## New names:
## Rows: 37667 Columns: 6
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (4): Team, Acquired, Relinquished, Notes dbl (1): ...1 date (1): Date
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
str(oldInjuries)
## spc_tbl_ [37,667 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ...1 : num [1:37667] 0 1 2 3 4 5 6 7 8 9 ...
## $ Date : Date[1:37667], format: "1951-12-25" "1952-12-26" ...
## $ Team : chr [1:37667] "Bullets" "Knicks" "Knicks" "Lakers" ...
## $ Acquired : chr [1:37667] NA NA NA NA ...
## $ Relinquished: chr [1:37667] "Don Barksdale" "Max Zaslofsky" "Jim Baechtold" "Elgin Baylor" ...
## $ Notes : chr [1:37667] "placed on IL" "placed on IL with torn side muscle" "placed on inactive list" "player refused to play after being denied a room in team's hotel" ...
## - attr(*, "spec")=
## .. cols(
## .. ...1 = col_double(),
## .. Date = col_date(format = ""),
## .. Team = col_character(),
## .. Acquired = col_character(),
## .. Relinquished = col_character(),
## .. Notes = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
head(oldInjuries)
sample_n(oldInjuries, 10)
nrow(oldInjuries)
## [1] 37667
Include R code and written explanation for wangling your data (you can make multiple wrangled data sets).
This returns the most common injures within the 2010-2020 nba injuries dataset, in order of most common to least common.
injuries %>%
count(Notes) %>%
arrange(desc(n))
returns the players within the data who are not associated with a team
injuries %>% filter(is.na(Team))
This returns the injuries obtained by the players who were aquired from other teams, either through free agency or trades
injuries %>% filter(!is.na(Acquired)) %>% count(Notes) %>% arrange(desc(n))
View data from 1951-2023 NBA injuries dataset
glimpse(oldInjuries)
## Rows: 37,667
## Columns: 6
## $ ...1 <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ Date <date> 1951-12-25, 1952-12-26, 1956-12-29, 1959-01-16, 1961-11-…
## $ Team <chr> "Bullets", "Knicks", "Knicks", "Lakers", "Lakers", "Laker…
## $ Acquired <chr> NA, NA, NA, NA, NA, "Elgin Baylor", "Elgin Baylor", NA, "…
## $ Relinquished <chr> "Don Barksdale", "Max Zaslofsky", "Jim Baechtold", "Elgin…
## $ Notes <chr> "placed on IL", "placed on IL with torn side muscle", "pl…
Create a feature for year in injuries and oldInjuries, month and year as well as Date
injuries$year <- year(injuries$Date)
oldInjuries$year <- year(oldInjuries$Date)
injuries$yr_mo <- format(injuries$Date, "%Y-%m")
injuries$month <- month(injuries$Date, label = TRUE)
injuries$day <- weekdays(injuries$Date)
injuries$rest <- str_detect(injuries$Notes, "rest")
See the total number injuries that has occurred in the NBA in every season since the 1975-1976 season
oldInjuries %>%
filter(year >= 1975) %>%
group_by(year) %>%
summarise(oldInjuries = n())
Load all the injuries in the NBA since 2010 by team, so that you can see the injuries for each individual team on a team by team basis
injuries_data <-
merge(x=injuries,
y = oldInjuries,
by.x = c("Team", "Notes", "Acquired", "Relinquished", "Date"),
by.y = c("Team", "Notes", "Acquired", "Relinquished", "Date"))
head(injuries_data)
Show all the players that have been acquired and relinquished since the 1951 NBA season due to injury
oldInjuries %>% pivot_wider(names_from = "Team", values_from = "Date")
Include R code and written explanation for your data visualization(s) with at least 3 variables. You must have at least one graph. You may have more than one (I would encourage you to have more than one).
This graph shows the number of injuries sustained by each NBA team since the year 2010
injuries %>%
filter(!is.na(Team)) %>%
count(Team) %>%
ggplot(aes(x=reorder(Team,n), y=n)) +
geom_col(fill = "red", color = "white") +
coord_flip() +
theme_bw() +
labs(x = "Team", y = "Number of injuries", title = "Injuries by Team in the NBA") +
theme(panel.grid.major.y = element_blank(), panel.background = element_rect(fill = "whitesmoke"))
The first graph shows the number of injuries per year in the NBA from the 2010 - 2020 season, the second graph shows the number injuries that took place month by month during the 2012 NBA season
it1 <- injuries %>%
count(year) %>%
ggplot(aes(x=year, y=n, group = 1)) +
geom_line() +
ylim(0,4000) +
geom_line(size = 1, color = "red") +
geom_point(size = 2, color = "black") +
geom_text(color = "black", aes(label = n, y=n+100)) +
theme_classic() +
labs(title = "Tracking injuries in the NBA", subtitle = "Note: 2010-2020", x = "Calendar Year", y = "Number of injuries") +
theme(panel.grid.major.y = element_blank(), panel.background = element_rect(fill = "white"))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
it2 <- injuries %>%
count(month) %>%
ggplot(aes(x=month, y=n, group = 1)) +
geom_line() +
ylim(0,2000) +
geom_line(size = 1, color = "red") +
geom_point(size = 2, color = "black") +
geom_text(color = "black", aes(label = n, y=n+200)) +
theme_classic() +
labs(title = "NBA Injuries by Month", subtitle = "Note: 2012 NBA season", x = "Month", y = "Number of injuries") +
theme(panel.grid.major.y = element_blank(), panel.background = element_rect(fill = "white"))
gridExtra::grid.arrange(it1, it2, ncol = 1)
## Warning: Removed 6 rows containing missing values (`geom_line()`).
## Warning: Removed 6 rows containing missing values (`geom_line()`).
## Warning: Removed 6 rows containing missing values (`geom_point()`).
## Warning: Removed 6 rows containing missing values (`geom_text()`).
Answer your research question using the data sets and visualizations you created.
This is important because it can help allow teams to prevent injuries to their star players, which is where most of the money and brand value for the team is tied up. Also this can help the athlete prevent injuries by being more aware of the most common ones, also determining the severity of the injury and the time or recovery as a result. This is interesting because the data shows that a majority of injuries that have taken place in the NBA over the last decade have been lower body injuries as opposed to upper body injuries. Some challenges that came along during the analysis was the role that rest played in not only the recovery process, but also in causing the injuries as well. There was a correlation found between excessive amounts of rest and an increase in injuries for said players. This leads to the notion that in order to prevent injury there should also be a baseline amount of activity otherwise the athlete puts themselves at risk for further injury. Also, the most common injury was found to be ankle injuries, with hamstring injuries as a close but slightly distant second place, followed by the calf muscles, achilles, ruptures, etc..
Outlook: Going forward it seems as if ankle injuries are the most common injury for NBA players by a mile, but the breakdown after that is pretty even. It is hard to distinguish what the most common injuries after below ankle injuries are, also the concept of rest is taken into account in the first dataset. This is important because sometimes injuries get misreported when in fact the player is just sitting for rest and safety precautions. The player resting data makes the data overall more accurate, and we are able to get a more accurate result as such. In terms of injuries that take the most time to recover for, they mostly consist of lower body injuries, particularly in the big muscle groups such as the quads, hamstrings, and other large tendons. This is interesting because there might be a correlation between how big or used the muscle is in the sport specific movement, and the time table for recovery when said body part is injured.